revision: use priority queue for streaming walks by spkrka · Pull Request #2127 · gitgitgadget/git

spkrka · 2026-05-27T14:40:07Z

This is a follow-up to kk/limit-list-optim (now in master), which
replaced the O(N) sorted linked-list insertion in limit_list() with
a priority queue. In the review thread for that patch, I
mentioned that the same approach could be applied to the streaming
(non-limited) walk in get_revision_1(). Junio suggested doing it
as a separate topic, and Peff noted he had a local branch
(jk/revs-commits-prio-queue) doing a broader conversion of
revs->commits to prio_queue entirely.

This series takes a lighter-weight approach: it keeps the linked
list for setup and external callers, and adds a separate
commit_queue field that the streaming walk drains into on first use.
This avoids touching bisect, line-log, and list-objects code, at
the cost of a "only one should be non-empty" invariant between the
two fields.

Together with the limit_list() change already in master, this
eliminates all commit_list_insert_by_date() callers from
revision.c.

Patch 1 is a small leak fix -- a missing release_revisions() call
in pack-objects that becomes visible once the commit queue uses a
dynamically allocated prio_queue array.

Patch 2 introduces a rev_walk_mode enum to replace the repeated
if/else chains in get_revision_1(). The function dispatches on
walk mode in multiple places (next commit, expand parents, flag
clearing) and these chains must stay in sync. The enum resolves
the mode once and both dispatch sites switch on the same variable.
This is a lighter alternative to the vtable-based refactoring I
mentioned before. No functional change.

Patch 3 is the actual conversion of the streaming walk to use a
priority queue.

== Why this helps ==

The streaming walk in get_revision_1() inserts newly discovered
parent commits into a date-sorted queue. On master, this uses
commit_list_insert_by_date(), which walks the linked list to find
the insertion point -- O(w) per insert, where w is the queue width
(active walk frontier).

In merge-heavy repositories, the walk frontier stays wide:

Repository Commits Peak width Avg width

monorepo (2.4M) 2,420K 2,653 1,700
linux.git 1,445K 581 235
git.git 82K 188 82

On the monorepo, each of the 2.4M commits requires scanning an
average of 1,700 list entries to find the insertion point. With
the priority queue, this drops to ~11 heap comparisons.

== Benchmarks ==

All benchmarks: best of 3 runs, same machine, commit-graph
present.

Streaming walks (affected by this series):

git rev-list --count HEAD (monorepo, 2.4M commits)

  master:   17.94s
  patched:   3.38s   (5.3x faster)

git rev-list HEAD (monorepo, full output)

  master:   27.72s
  patched:   8.61s   (2.8x faster, I/O-bound fraction unchanged)

Regression checks -- non-merge-heavy repos (streaming path,
but frontier stays narrow so O(w) insertion was never the
bottleneck):

git rev-list --count HEAD (linux.git, 1.4M commits)

  master:    1.76s
  patched:   1.81s   (no change)

git rev-list HEAD (linux.git, full output)

  master:    4.46s
  patched:   4.52s   (no change)

git rev-list --count HEAD (git.git, 82K commits)

  master:     83ms
  patched:    86ms   (no change)

Regression checks -- other walk modes (not affected by this
series):

git rev-list --count HEAD~5000...HEAD (monorepo, limited path)

  master:    7.36s
  patched:   7.02s   (no change)

== Profile breakdown ==

perf profiling of rev-list --count HEAD on the monorepo shows
where the time goes:

master (17.94s):
commit_list_insert_by_date 79% 14.25s
fixed overhead (parse/lookup) 21% 3.69s

patched (3.38s):
heap ops (compare + sift) 16% 0.53s
fixed overhead (parse/lookup) 84% 2.85s

The queue maintenance itself sped up 27x (14.25s to 0.53s).
The overall 5.3x is lower because the fixed costs -- object
lookup (17%), commit-graph parsing (14%), memory allocation
(10%) -- are roughly constant between the two versions at ~3s.

This means the patch removes the dominant bottleneck entirely.
After the patch, the walk cost is dominated by irreducible
per-commit work (parsing and object lookup) which scales
linearly with commit count regardless of frontier width.

enumerate_and_traverse_cruft_objects() initializes a rev_info on the stack but never calls release_revisions() afterwards. This is not visible on master but becomes a leak once the revision walking machinery uses dynamically allocated structures. Add the missing release_revisions() call. Signed-off-by: Kristofer Karlsson <krka@spotify.com>

get_revision_1() dispatches to different walk strategies based on a combination of rev_info flags: reflog_info, topo_walk_info, and limited. These conditions are checked in multiple places within the function -- once to select the next commit, and again to decide how to expand parents -- and the two chains must stay in sync. Extract the mode selection into a rev_walk_mode enum and a small get_walk_mode() helper, resolved once at the top of get_revision_1(). Both dispatch sites now switch on the same mode variable, making it obvious that they agree and easier to verify that all modes are handled. No functional change. Signed-off-by: Kristofer Karlsson <krka@spotify.com>

The streaming (non-limited) walk in get_revision_1() inserts newly discovered parent commits into a date-sorted queue via commit_list_insert_by_date(), which scans the linked list to find the insertion point -- O(w) per insert, where w is the width of the active walk frontier. Replace this with an O(log w) priority queue. Add a commit_queue field to rev_info alongside the existing commits linked list. The two representations are mutually exclusive: setup and external callers that need list access use the linked list, then get_revision_1() lazily drains it into the priority queue on first call. Add a REV_WALK_NO_WALK enum value to distinguish the no_walk case (which still uses the commit list) from the streaming case. The conversion function rev_info_commit_list_to_queue() is public so callers that know they will iterate can convert early. Combined with the limit_list() priority queue change already in master, this eliminates all O(w) sorted linked-list insertion from the revision walk machinery. Signed-off-by: Kristofer Karlsson <krka@spotify.com>

spkrka · 2026-05-27T15:48:54Z

/submit

gitgitgadget · 2026-05-27T15:50:24Z

Submitted as pull.2127.git.1779897003.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2127/spkrka/streaming-prio-queue-v1

To fetch this version to local tag pr-2127/spkrka/streaming-prio-queue-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2127/spkrka/streaming-prio-queue-v1

spkrka added 3 commits May 27, 2026 17:11

spkrka force-pushed the streaming-prio-queue branch from 58004d9 to c0e7870 Compare May 27, 2026 15:19

spkrka marked this pull request as ready for review May 27, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revision: use priority queue for streaming walks#2127

revision: use priority queue for streaming walks#2127
spkrka wants to merge 3 commits into
gitgitgadget:masterfrom
spkrka:streaming-prio-queue

spkrka commented May 27, 2026 •

edited

Loading

Uh oh!

spkrka commented May 27, 2026

Uh oh!

gitgitgadget Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spkrka commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repository Commits Peak width Avg width

Uh oh!

spkrka commented May 27, 2026

Uh oh!

gitgitgadget Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spkrka commented May 27, 2026 •

edited

Loading